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Abstract 

In this paper, we investigate different approaches for dialect 
identification in Arabic broadcast speech. These methods are 
based on phonetic and lexical features obtained from a speech 
recognition system, and bottleneck features using the i-vector 
framework. We studied both generative and discriminative clas¬ 
sifiers, and we combined these features using a multi-class Sup¬ 
port Vector Machine (SVM). We validated our results on an 
Arabic/English language identification task, with an accuracy 
of 100%. We also evaluated these features in a binary classi¬ 
fier to discriminate between Modem Standard Arabic (MSA) 
and Dialectal Arabic, with an accuracy of 100%. We further 
reported results using the proposed methods to discriminate be¬ 
tween the five most widely used dialects of Arabic: namely 
Egyptian, Gulf, Levantine, North African, and MSA, with an 
accuracy of 59.2%. We discuss dialect identification errors in 
the context of dialect code-switching between Dialectal Arabic 
and MSA, and compare the error pattern between manually la¬ 
beled data, and the output from our classifier. All the data used 
on our experiments have been released to the public as a lan¬ 
guage identification corpus. 

Index Terms: Dialect Identification, Vector Space Modelling 

1. Introduction 

The task of Dialect Identification (DID) is a special case of the 
more general problem of Language Identification (LID). LID 
refers to the process of automatically identifying the language 
class for given speech segment or text document. DID is ar¬ 
guably a more challenging problem than LID, since it consists 
of identifying the different dialects within the same language 
class. The importance of addressing DID can be gauged from its 
growing interest in the Automatic Speech Recognition (ASR) 
community [1]. A good DID system can facilitate the identifica¬ 
tion of dialectal segments from an untranscribed mixed-speech 
dataset. This process can help reduce the ASR word error rate 
(WER) for dialectal data by training ASR systems for each di¬ 
alect, or by adapting the ASR models to a particular dialect. 

The natural language processing (NLP) community has ag¬ 
gregated dialectal Arabic into five regional language groups: 
Egyptian (EGY), North African or Maghrebi (NOR), Gulf or 
Arabian Peninsula (GLF), Levantine (LAV), and Modem Stan¬ 
dard Arabic (MSA). An objective comparison of the varieties 
of Arabic dialects could potentially lead to the conclusion 
that Arabic dialects are historically related, but not synchron- 
ically, and are mutually unintelligible languages like English 
and Dutch. Normal vernacular can be difficult to understand 


across different Arabic dialects [?]. Arabic dialects are thus suf¬ 
ficiently distinctive, and it is reasonable to regard the DID task 
in Arabic as similar to the LID task in other languages. Table 1 
shows two phrases across the different dialects, it is clear from 
this example that there are lexical variations across the different 
dialects which motivates us to consider it. 

Two broad LID approaches have been investigated in the 
literature: low-level acoustic features, and high-level phonetic 
and lexical features. In the lexical area, words, roots, morphol¬ 
ogy, and grammars [2, 3] have been studied. Acoustic features 
such as shifted delta cepstral coefficients [17] and prosodic fea¬ 
tures [5] using Gaussian mixture models (GMMs), i-vector rep¬ 
resentations and support vector machine (SVM) classifiers [17] 
have been shown to be effective for LID. More recent work ex¬ 
plored the use of frame-by-frame phone posteriors (PLLRs) [6] 
as new features for LID. New subspace approaches based on 
non-negative factor analysis (NFA) for GMM weight decom¬ 
position and adaptation [7] were also applied to both LID and 
DID tasks. GMM weight adaptation subspaces seem to provide 
complementary information to the classical i-vector framework. 
Finally, phoneme sequence modeling and its n-gram subspace 
have been studied for both Arabic DID [8] and LID [9]. 


EGY 

GLF 

LAV 

MSA 

NOR 

Translation 

Aiji 


ijjijiS" J 


■^'j lt'j 

How are you? 

AzAYk 

A$lwnk 

kyfk / ASlwnk 

kyf HAlk 

wA$ rAk 




-Iq.! 

^[) M 

Where are you? 

Ant fyn 

wynk 

wynk 

Ayn Ant 

wyn rAk 


Table 1: Lexical examples in Arabic and Buckwalter format. 

In this paper we investigate three Vector Subspace Models 
(VSMs) for Arabic DID based on 1) lexical, 2) phonetic, and 
3) i-vectors. We conduct a thorough feature selection study of 
these models to better understand their interaction. A further 
contribution of this work is the release of an Arabic DID sys¬ 
tem so others can extend and improve DID performance on this 
task.' 

2. Vector Space Models 

2.1. Senone based Utterance VSM 

Senone refers to an n-gram phone sequence. In our case n < 4. 
VSM construction takes place in two steps: first, a phoneme rec¬ 
ognizer is used to extract the senone [10] sequence for a given 
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speech utterance. The phoneme sequence is obtained by auto¬ 
matic vowelization of the training text, followed by voweliza- 
tion to phonetization (V2P). The 36 chosen phonemes cover all 
the dialectal Arabic sounds. Further details about the speech 
recognition pipeline, training data, and phoneme set is given 
in [11]. For the phoneme sequence, we process the phoneme 
lattice, and obtain the one-best transcription, ignoring silences 
as well as noisy silences. Each speech utterance (u) is then rep¬ 
resented as a high dimensional sparse vector (tr): 

u = si)), ^(/(«, S 2 )),..., A{f{u, Sd))) , (1) 

where f{u, Si) is the number of times a senone Si occurs in the 
speech utterance u, and A is the scaling function. We experi¬ 
ment with both an identity scaling function and tf.idf scaling 
function, commonly used in the field of Natural Language Pro¬ 
cessing [12] to downweight the contribution of the words (in 
our case senones) that occur in almost all documents (in our 
case utterances), as these words (senones) do not provide any 
discriminative information about the documents (utterances). 

The vector space is then represented by the matrix. Us € 
jjdx AT pjg ppjg approach and the notation used to define 
a VSM is directly inspired by the seminal works in the area of 
VSM of Natural Language in [13, 14, 15] and in LID [?]. 


Si 

u-i 

■ A{f{si,ui) 

U2 

Aif{si,U 2 ) . 

■ ■ A{f{si,UN) 

S2 

Aif{S2,Ul) 

A{f{s 2 ,U 2 ) . 

. . A{f{s 2 ,UN) 

Sd 

. A{f{sd,ui) 

A{f{sd,U 2 ) . 

•• A{f{sd,UN) 


Figure 1: Senone-based utterance VSM. Column vectors of the 
matrix correspond to the speech utterance vector representation 
formed using equation 1. d is the size of the senone dictionary, 
and N is the total number of speech utterances in the dialectal 
speech database. 

2.2. Word based Utterance VSM 

The word-based utterance VSM (Uw) is constructed in two 
steps in a manner similar to the senone features: An ASR sys¬ 
tem is used to extract the word sequence for each utterance 
in the speech database. Details about the ASR system can be 
found in [11]. Each speech utterance (u) is then represented as 
a high-dimensional sparse vector (u): 

u = {A{f{u,wi)),A{f{u,W 2 )),... ,A{f{u,Wd>))), (2) 

where f{u,Wi) is the number of times a word Wi occurs in 
the speech utterance u and A is the scaling function which has 
the same interpretation as for Us (above). Vocabulary size was 
55k. The tri-gram dictionary size was 580k which we used to 
construct the word based VSM 

2.3. i-vector-based Utterance VSM 

2.3.1. Bottleneck Features (BN) 

Recently, bottleneck features extracted from an ASR DNN- 
based model were applied successfully to language identifica¬ 
tion [29, 30, 28]. In this paper, we used a similar bottleneck 
features configurations as in our previous ASR-DNN system 
for MSA speech recognition [27]. This system is based on two 


successive DNN models. Both DNNs use the same setup of 
5 hidden sigmoid layers and 1 linear BN layer, and they were 
both based on tied-states as target outputs. The senone labels 
of dimension 3040 are generated by a forced alignment from 
an HMM-GMM baseline trained on 60 hours of manually tran¬ 
scribed Al-Jazeera MSA news recordings [11]. The input to the 
first DNN consists of 23 critical-band energies that are obtained 
from Mel filter-bank. Pitch and voicing probability are then 
added. 11 consecutive frames are then stacked together. The 
second DNN is used for correcting the posterior outputs of the 
first DNN. In this architecture, the input features of the second 
DNN are the outputs of the BN layer from the first DNN. Con¬ 
text expansion is achieved by concatenating frames with time 
offsets of -10, -5, 0, 5, and 10. Thus, the overall time context 
seen by the second DNN is 31 frames. 

2.3.2. Modeling 

An effective and well-studied method in language and dialect 
recognition is the i-vector approach [7, 16, 17]. The i-vector 
involves modeling speech using a universal background model 
(UBM) - typically a large GMM - trained on a large amount 
of data to represent general feature characteristics, which plays 
a role of a prior on how all dialects look like. The i-vector ap¬ 
proach is a powerful technique that summarizes all the updates 
happening during the adaptation of the UBM mean components 
to a given utterance. All this information is modeled in a low 
dimensional subspace referred to as the total variability space. 
In the i-vector framework, each speech utterance can be repre¬ 
sented by a GMM supervector, which is assumed to be gener¬ 
ated as follows: 

M = u + Tv 

Where u is the channel and dialect independent supervector 
(which can be taken to be the UBM supervector), T spans a 
low-dimensional subspace and v are the factors that best de¬ 
scribe the utterance-dependent mean offset. The vector v is 
treated as a latent variable with the i-vector being its maximum- 
a-posteriori (MAP) point estimate. The subspace matrix T is 
estimated using maximum likelihood on large training dataset. 
An efficient procedure for training and for MAP adaptation of i- 
vector can be found in [18]. In this approach, the i-vector is the 
low-dimensional representation of an audio recording that can 
be used for classification and estimation purposes. In our ex¬ 
periments, the UBM was a GMM with 2048 components, BN 
features were used, and the i-vectors were 400-dimensional. 

In order to maximize the discrimination between the differ¬ 
ent dialect classes in the i-vector space, we combine Linear Dis¬ 
criminant Analysis (LDA) and Within Class Co-variance Nor¬ 
malization [17]. This intersession compensation method has 
been used with both SVM [17] and cosine scoring [7]. 

3. Dataset 

3.1. Train Data 

The training corpus was collected from the Broadcast News do¬ 
main in four Arabic dialects (EGY, LAV, GLF, and NOR) as 
well as MSA. Data recordings were carried out at 16Khz. The 
recordings were segmented to avoid speaker overlap, remov¬ 
ing any non-speech parts such as music and background noise. 
More details about the training data can be found in [7]. Al¬ 
though the test database came from the same broadcast domain, 
the recording setup is different. The test data was downloaded 
directly from the high quality video server for Aljazeera (bright- 



cove) over the period of July 2104 until January 2015, as part 
of QCRI Advanced Transcription Service (QATS) [19]. 


Data 

EGY 

GLF 

LAV 

NOR 

MSA 

ENG 

Train 

13 

9.5 

11 

9 

10 

10 

Test 

2 

2 

2 

2 

2 

2 


Table 2: Number of hours of speech available for each dialect. 


3 . 2 . Test Data 

The test set was labeled using the crowdsource platform Crowd- 
Flower, with the criteria to have a minimum of three judges 
per file and up to nine judges, or 75% inter-annotator agree¬ 
ment (whichever comes first). More details about the test set 
and crowdsourcing experiment can be found in [20]. The test 
set used in this paper differs from that used in [7] for two rea¬ 
sons: First, the crowdsourced data is available to reproduce the 
results, and thus can be used as a standard test set for Arabic 
DID; second, the new test set has been collected using different 
channels, and recording setup compared to the training data, 
which makes our experiments less sensitive to channel/speaker 
characteristics. 

The train and test data can be found on the QCRI web por¬ 
tal^. Table 2 and Table 3 present some statistics about the train 
and the test data. 


Data 

EGY 

GLE 

LAV 

NOR 

MSA 

ENG 

Train 

1720 

1907 

1059 

1934 

1820 

1649 

Test 

315 

348 

238 

355 

265 

452 


Table 3: Number of speech utterances for each dialect. 


4. Experiments 

4.1. Choosing the Best Classifier 

We first studied the best classification approach for the DID 
task from a set of two generative models: n-gram language 
model [21] and Naive Bayes [22], and two discriminative clas¬ 
sifiers: linear SVM [23] and Maximum Entropy [24]. We mea¬ 
sured the performance of each model on the DID task, in the 
word or lexical-based utterance vector space, which is con¬ 
structed using the approach mentioned in section 2, using iden¬ 
tity scaling function A, and performing no dimensionality re¬ 
duction. Hence, the dimensionality of an utterance vector, it, is 
the same as the size of the lexicon, which in our case was 55k. 
Results can be seen in table 4. As the linear SVM performs the 
best, it is our choice of classifier for the rest of the experiments. 

4.2. Feature Selection Study 

Here we examine the dialect information captured by the three 
utterance VSMs explained in section 2. We also explore the 
concatenation of the utterance vector representations, and report 
the results in Tables 5 and 6. Details about the terms in the 
results table are given below: 

• U[„: Refers to the utterance VSM in which each ut¬ 
terance is represented by a vector given by equation 2, 
where A is chosen to be the identity function. The bases 

^http://alt.qcri.org/resources/ArabicDialectIDCorpus/ 


Model 

ACC 

PRC 

RCL 

n-gram Language Model 

40.4% 

40.2% 

41.3% 

Naive Bayes 

37.9% 

37.5% 

50.2% 

Max Ent 

40% 

40% 

40.6% 

SVM 

45.2% 

44.8% 

45.4% 


Table 4: Performance of different classifiers using lexical fea¬ 
tures, with lexicon size of 55K. ACC, PRC and RCL correspond 
to accuracy, precision and recall on the test set. 


of the vectors are the words in the lexicon. SVD is 
used to reduce the dimensionality of the utterance Vec¬ 
tor Space from 55k originally, to 300, 600, 1200, 1600 
at which point increase the gain in the classification per¬ 
formance tends to saturate. 

• Same as the previous Utterance VSM, except 
that A is chosen via tf.idf [12] instead of identity func¬ 
tion, which gives us significant improvement in accuracy 
over the previous vector space. 

• Ug: Refers to the utterance VSM in which each ut¬ 
terance is represented by a vector given by equation 1, 
where A is chosen to be the identity function. Utterance 
vector bases corresponds to senones. Just as with the 
word-based utterance VSM, we use SVD on the vector 
space and experiment with different dimensions. The ut¬ 
terance Vector Space constructed using senone features 
is more discriminative than word-based Vector space. 

• Refers to the same vector space as the previous 
one, except that A is chosen to be the tf.idf function. 
tf.idf, does not help in the case of senone features. 

• Feature Combination: Combining the best senone- 

based utterance VSM, (7](600d), and the best lexical- 
based utterance VSM, {1200d), to form a con¬ 

catenated feature vector representation. SVD is per¬ 
formed to reduce the dimensions of the feature space. 
Feature combination does not help and hence we con¬ 
clude that the two vector spaces are capturing similar in¬ 
formation. 

• Refers to the utterance VSM, where each utter¬ 
ance is represented by a compact 400d i-vector (sec¬ 
tion 2.3). We use the bottleneck features to train the 
UBM, which is then used to extract the i-vector. We 
do not experiment with different i-vector dimensions and 
take the best dimension reported in [17] for the LID task. 
The i-vector feature space is significantly more discrim¬ 
inative than previously defined feature spaces. 

• Uj^ec+LDA-rwcNN^ Reducing the dimensionality of 
the i-vector space using EDA and performing WCNN 
has been reported to do well in LID tasks [17] and we use 
the same technique and see a significant improvement in 
the DID results. 

• Uivec+LDA+WCNN + U^600d): Finally we con¬ 
catenate the best senone-based VSM with the best i- 
vector-based VSM, to form a concatenated vector repre¬ 
sentation for each utterance and see slight improvements 
in the results. As the lexical and senone-based represen¬ 
tations encode the same information about the dialect, 
we do not experiment with concatenated lexical and i- 
vector representations. 





d = 300 


d = 600 


d = 1200 


d 

= 1600 


ACC 

PRC 

RCL 

ACC 

PRC 

RCL 

ACC 

PRC 

RCL 

ACC 

PRC 

RCL 

UL 

38.3 

41.9 

39.4 

41.7 

44.1 

42.8 

42.9 

45.6 

44 

42.9 

45 

43.8 


43.3 

42.7 

43.5 

44.6 

44 

44.9 

45.5 

45.1 

45.8 

21.9 

20.9 

21.9 

ui 

45.2 

44.8 

45.9 

45.8 

45.1 

46.5 

45.2 

44.7 

45.8 




utfldf 

44 

43.9 

44.7 

44 

44.2 

44.6 

43.9 

44 

44.3 




Feature Combination 

44.8 

44.2 

45.6 

44.1 

43.4 

44.8 

44.8 

44.1 

45.4 





Table 5: Accuracy, Precision and Recall for different senone and lexical feature based Vector Spaces, d is the dimensionality of 
the Vector Space. Boldfaced numbers are the best accuracy for the corresponding vector space, for a corresponding vector space 
dimensionality d. A detailed explanation of feature spaces is given in the feature selection study (section 4.2) 


Feature Space 

d 

ACC 

PRC 

RCL 


400 

55.3 

61 

55.9 

iVec-pLDA+WCNN 

4 

58.5 

62.3 

58.9 

iVec-hLDA-t-WCNN+LNOR.M 

4 

58.7 

61.9 

59.3 

UiVec-l-LDA-HWCNN + U^600d) 

604 

59.2 

62.7 

59.5 


Table 6: Accuracy, Precision and Recall for different i-vector 
based feature spaces, d refers to the dimensionality of the Vec¬ 
tor Space. A detailed explanation of feature spaces is given in 
the feature selection study (section 4.2). 



EGY 

GLF 

LAV 

MSA 

NOR 

Total Truth 

PRC 

EGY 

221 

15 

57 

13 

9 

315 

50.3% 

GLF 

45 

121 

82 

12 

5 

265 

55.8% 

LAV 

74 

43 

199 

18 

14 

348 

46.9% 

MSA 

19 

17 

20 

218 

5 

279 

77% 

NOR 

80 

21 

66 

22 

166 

355 

83.4% 

#class 

439 

217 

424 

283 

199 



RCL 

70.2% 

45.7% 

57.2% 

78.1% 

46.8% 




Table 7: Confusion Matrix for DID. 


4.3. One Vs All classiiication (Sanity Check) 

We constructed a senone-based utterance VSM (section 2.1) 
based on 20 hours of speech; 10 hours English (which we got 
from [?]) and 10 hours Arabic (randomly sampled from our 
training data, section 3). Binary classification (English vs Ara¬ 
bic) using an SVM classifier, was then performed and it yielded 
100% accuracy on the 1.5 hour test set. The reason to choose 
the senone-based feature space and not the i-vector-based fea¬ 
ture space for classification is to avoid channel mismatch, as 
the English data came from a different source domain. We did a 
similar experiment to classify MSA versus all dialectal Arabic 
and again obtained 100% classification accuracy. 

4.4. System Output Combination 

We fused the scores of the best senone system and the SVM- 
based i-vector system. In the fusion steps, the original scores 
of each system were normalized and combined using the same 
fusion weights for both systems. This approach yielded a final 
accuracy of 60.2%, which is the best performance we achieved. 
One explanation for this gain is that the error patterns for the 
two feature spaces are quite different, and we were able to con¬ 
firm that by analyzing the confusion matrix for each system. 

5. Discussion 

We infer from the confusion matrix in Table 7 that GEE and 
LAV are the most confusable dialect pair. We believe that this 
is related to the greater lexical similarity between these two di¬ 
alects (see Table 1). Note, the confusion matrix is from the best 
DID system. We borrowed Table 8 from previous work [20] on 
the test set, which shows the amount of time the same speak¬ 
ers switch between dialect and another (mainly MSA, and their 
own native dialect). For example, in the second row of Table 8, 
there are 200 samples from potential Gulf speakers. After man¬ 
ually labeling, there were 106(53%) segments labeled as MSA, 


82(41%) validated as GEE, 8(4%) as LAV, and 4 segments were 
not identified with enough confidence to be considered. This 
means more than 50% of the random GLF speakers data is in¬ 
fact MSA speech segments. This is strong evidence for the 
amount of code-switching between one dialect and MSA from 
the same speaker. 


Expected Dialect 

EGY 

GLF 

LAV 

NOR 

MSA 

EGY 

65% 




32% 

GLF 


41% 

4% 


53% 

LAV 

1% 

1% 

53% 


39% 

NOR 

1% 



69% 

28% 


Table 8: Expected dialect of each speech segment from particu¬ 
lar dialectal speakers. 

6. Conclusions 

This paper presents our efforts on automatic dialect identifica¬ 
tion for Arabic broadcast speech. We have demonstrated a di¬ 
alect classifier with an accuracy of 60.2% using system combi¬ 
nation. We also achieved 100% accuracy on two binary classi¬ 
fication tasks; MSA vs Dialectal Arabic and English vs Arabic. 
We studied the potential code-switching pattern in our classifier 
and its correlation with the manual annotation. Further work 
for this research is to study the code-switch between MSA and 
dialectal Arabic without considering speaker diarization or si¬ 
lence between speech segments in what can be called dialect di¬ 
arization. We shall also study deep neural network approaches 
of classification to learn a more complex non-linear decision 
boundary. 
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